GWAS reveals determinants of mobilization rate and dynamics of an active endogenous retrovirus of cattle

Five to ten percent of mammalian genomes is occupied by multiple clades of endogenous retroviruses (ERVs), that may count thousands of members. New ERV clades arise by retroviral infection of the germline followed by expansion by reinfection and/or retrotransposition. ERV mobilization is a source of deleterious variation, driving the emergence of ERV silencing mechanisms, leaving “DNA fossils”. Here we show that the ERVK[2-1-LTR] clade is still active in the bovine and a source of disease-causing alleles. We develop a method to measure the rate of ERVK[2-1-LTR] mobilization, finding an average of 1 per ~150 sperm cells, with >10-fold difference between animals. We perform a genome-wide association study and identify eight loci affecting ERVK[2-1-LTR] mobilization. We provide evidence that polymorphic ERVK[2-1-LTR] elements in four of these loci cause the association. We generate a catalogue of full length ERVK[2-1-LTR] elements, and show that it comprises 15% of C-type autonomous elements, and 85% of D-type non-autonomous elements lacking functional genes. We show that >25% of the variance of mobilization rate is determined by the number of C-type elements, yet that de novo insertions are dominated by D-type elements. We propose that D-type elements act as parasite-of-parasite gene drives that may contribute to the observed demise of ERV elements.

encompassing the ERVK insertion (black triangle) in the coding exon 5 of the APOB gene; with from top to bottom: the whole genome sequence (WGS) BAM files of a homozygous wild-type (1) and a heterozygous carrier (2) with reads from discordant read pairs surrounding the ERVK insertion showing a multicolor display; (1)! (2)!
ERV2-1-LTR-BT! a b c d liver cDNA sequence reads (light blue) for a homozygous wild-type (3) and a homozygous mutant (4) highlighting the transcriptional shut-off in APOB exon 5; genomic organization of the APOB gene with exons and introns displayed as thick and thin dark blue rectangles respectively (5).(b) IGV screen capture of a zoomed 500 bp region surrounding the ERVK insertion site; with -from top to bottom: the WGS of a heterozygous carrier (1) with target site duplication (TSD) pinpointed by a black triangle; reads from the allele without the insertion are shown in grey; reads from discordant read pairs surrounding the breakpoints are showing a multicolor display and reads overlapping the breakpoints a characteristic softclipped feature (lack of homology with the reference sequence); cDNA sequence reads (light blue) for a homozygous mutant (2) highlighting the transcriptional shut-off in APOB exon 5 and reads overlapping the breakpoint showing a characteristic soft-clipped feature.(c) Liver cDNA sequence from a 3'RACE experiment performed on a homozygous mutant; with coding exon 4 and 5 in dark blue; the TSD is shown in blue, bold and underlined; from the ERVK insertion site, the transcribed part of its 5' LTR sequence is presented in black, ending with a poly(A) tail preceded by canonical TATA box and 'AATAAA' poly(A) signal (underlined); the sequence boxed in dark grey corresponds to the Sanger cDNA sequence trace displayed in panel d.(d) Sanger cDNA sequence trace from the above 3'RACE experiment, highlighting TATA box, poly(A) signal and poly(A) tail with black horizontal bars; a schematic representation of the full-length ERVK insertion in APOB exon 5 is depicted above the sequence trace.
Supplementary Figure 2: Genomic features of polymorphic ERV elements.(a) Proportional representation of ERVK sub-groups in genome space (light grey bars) and amongst polymorphic elements detected by LocaTER (dark grey bars).Repbase reports 33 subgroups of ERVK, of which the most abundant in the reference genome are BTLTR1 (38.9%),ERVK[2-1-LTR] (20.2%),BLTR1B (10.2%) and BLTR1J (7.1%).While BTLTR1 is underrepresented amongst polymorphic ERVK elements (30.1%),ERVK[2-1-LTR] (26.6%),BLTR1B (27.1%) and BLTR1J (8.8%) are respectively overrepresented 1.3, 2.6 and 1.2 times.Of note, the very rare ERVK[2-3-LTR] subgroup (0.5%) is 4 times overrepresented amongst polymorphic ERVK elements.This suggests that the latter four subgroups, especially, might still be active.(b) Genomic distribution of polymorphic ERVK elements (dark grey) compared to the corresponding genome space (light grey).Left: intergenic versus genic space.Right: antisense versus sense orientation for genic elements.(c) Genomic distribution of polymorphic ERV elements other than the ERVK group (dark grey) compared to the corresponding genome space (light grey).Left: intergenic versus genic space.Right: antisense versus sense orientation for genic elements.The respective proportions did not differ significantly between ERVK and non-ERVK elements.See also main text.(d) Derived Allele Frequency (DAF) spectrum for genic (dark grey) and intergenic (light grey) ERV elements.DAF of genic insertions are slightly (p = 0.06) shifted towards lower values as expected under purifying selection.(e) Derived Allele Frequency (DAF) spectrum for genic antisense (dark grey) and genic sense (light grey) ERV elements.There is no significant evidence (Wilcoxon sum of rank test) p = 0.99) for a shift of DAF of sense insertions towards lower values when compared to antisense, as expected if they would be subject to stronger purifying selection.Source data are provided as a Source Data file.Upper logo: nucleotide composition flanking 3669 ERVK[2-1-LTR] insertion sites.The height of the column corresponds to the departure from uniform nucleotide composition (measure of entropy).It shows strong signals for positions 1 (G) and 6 (C) of the duplication, as well as positions -3 (A), -2 (T), +2 (T) and +3 (A), as well as weaker up-and down-stream signals that appear to show a 10-bp (one helical turn) periodicity.Lower logo: log(1/p) value of the enrichment/depletion of the four nucleotides with respect to the average across a 100 bp window centered on the insertion.Pvalues were determined using z-scores computed with the mean and standard deviation of the abundance of the corresponding nucleotide in the window.The signal was strongest for positions 1 (G) and 6 (C) of the duplication, as well as positions -3 (A), -2 (T), -1 (G), +1 (C), +2 (T), +3 (A), and +5 (T).Although palindromic in appearance (5'ATGG….CCAT-3') when considering average base pair composition, there was no evidence for palindromicity when considering individual sequences as reported by Kirk et al. (2016).Logos were plotted using http://kplogo.wi.mit.edu/.Source data are provided as a Source Data file.and a D ERVK[2-1-LTR] element are displayed below the dot plot.Decreasing similarity (attrition) is marked by arrows with gradients (from high to low similarity).The arrowheads mark the boundaries of the GAG and ENV shared sequences used to generate the tree shown in b.(b) Indeed, these segments do not show obvious "within segment" traces of recombination between C-and D-type elements so should provide the most correct relationship between C-and D-type (see hereafter for one "between segment" recombination).Insertion-deletions of more than one nucleotide were collapsed to single events.C-type elements are highlighted in blue.C-and D-type elements appear as clear distinct clades.One C-element (chr.24) is closer to the D-clade.It corresponds to the only element that has a complete C-type GAG-shared sequence associated with a D-type ENV-shared sequence.The average pairwise difference between C-and D-type elements was 38.4 in 250 base-pairs.Assuming a de novo mutation rate of 1x10 -8 base pairs per generation [30] , this would correspond to ~15 million generations or ~50 million years, hence suggesting that the endogenization of the exogenous retrovirus precursor of the ERVK[2-1-LTR] element is a very ancient event.It is, however, possible that some of the differences between C-and Dtype shared sequences were introduced upon creation of D-type elements, and/or that the mutation rate for ERV elements, influenced by the retro-transposition process, is higher than 1x10 -8 base pairs per generation.Estimates of the divergence are also influenced by the somewhat arbitrary definition of the boundaries of the GAG-shared and ENV-shared segments.protein sequences.Number of amino-acids is given above each ORF (GAG,722;PRO,310;POL,885;ENV,712).The corresponding amino acid sequence is highlighted with the same color code below each domain scheme.Amino acids in italics at the N-terminal part of the PRO protein correspond to putative translation start sites.See also Supplementary Table 1 for a more detailed description of each protein domain.(chr24:51,127,026; middle connected red-blue-red rectangles).Their corresponding genomic sequence is given below with the same order and color code.homozygous (no reads bridging the 5' and 3' insertion sites) in the individual.The number of 5' improperly paired reads, and 3' improperly paired reads are then tested against the genome average for significance and the site retained if either are significantly different.Each site is annotated with any gene it overlaps, and the ERV class with the most mates aligned to it is selected as the likely class of the new ERV insertion.The pedigree for the population is then analyzed to identify trios and each site is checked for any violations of Mendelian inheritance (absent in both parents but present in the proband).Each site is then reported with the associated statistics and list of identified carriers, along with their likely genotype.

Supplementary Figure 1 :
Identification and functional validation of the causative mutation for CD.(a) Integrative Genome Viewer (IGV) screen capture of a 10 kb genomic region

Figure 3 : 4 :
Pedigree-based identification of five de novo insertions of ERVK[2-1-LTR] elements.(a) IGV screen captures of the WGS corresponding to the respective genomic regions surrounding each of the five de novo insertions (1 to 5); with -from top to bottom -absence of the ERVK[2-1-LTR] element in the parents, presence in the offspring and transmission to grand-offspring in perfect linkage disequilibrium; a dark blue border highlights the family with two insertions transmitted by the same sperm cell; blue and pink backgrounds feature de novo events occurring in the paternal or maternal germline respectively; '+' for wild-type allele and 'I' for allele with insertion.Sires and Dams are numbered as in main Fig. 2d.(b) Schematic drawing of de novo events occurring either in the paternal (left) or maternal (right) germline.de novo event (I P )! in the sire's germ line!Assessing the repeatability and robustness of PCIP and lack of evidence of an effect of a bull's inbreeding coefficient on the rate of ERVK[2-1-LTR] mobilization in its germline.(a) Correlation between the 5' and 3' PCIP estimates of the mobilization rate of landscape of ERVK[2-1-LTR] de novo insertions.(a) Distribution of ERVK[2-1-LTR] insertions across standardized chromosome length.Chromosomes were split in 100 equally sized bins; the number ERVK[2-1-LTR] insertions counted for each bin, divided by the total number insertions on that chromosome, and multiplied by 100.Bin-specific percentages were then averaged across the 29 autosomes.A regression curve was fitted to the data using LOESS.ERVK[2-1-LTR] insertions appear to preferentially occur towards the telomere.Of note, all bovine autosomes are acrocentric.Thus, bin 1 is close to the centromere, and bin 100 close to the telomere.(b) Effect of local GC content on the rate of ERVK[2-1-LTR] insertion.The genome was subdivided in 2505 nonoverlapping 1Mb bins.GC content was computed for each bin, and corrected (using a linear model) for distance from chromosome center.We then looked at the distribution of GC content for bins with 0, 1, 2, … 12 ERVK[2-1-LTR] insertions.There was a striking positive correlation between GC content and the number of insertions (r= 0.08, p=2.7x10 -5 ).A rectangle is drawn to represent the second and third quartiles with a horizontal line inside to indicate the median value.(c) 6-bp duplication (grey box) and 8-bp pseudo-palindromic motif at ERVK[2-1-LTR] insertion sites.

Figure 9 :
Protein sequences and domain annotation of a representative element of the C-clade (chr19: 50,466,809 bp).Schematic representation of the four ORFs' domain annotation according to specific hits and super-families obtained by blasting each ORF against 'viruses' (taxid: 10239) non-redundant

Figure 10 :
Examples of APOBEC3 mutational signature, recombination events and attrition at the boundaries.(a) Illustration of APOBEC3 induced G-to-A transitions on the plus-strand of the ERVK[2-1-LTR] genetic code: IGV screen capture of genomic sequence reads obtained from PCR amplicons of two clade 'C' ERVK[2-1-LTR] elements (chr4:119,518,125: top panel and chr24:51,127,026: bottom panel) mapped on the ~10Kb (5' to 3') reference genomic sequence of a representative clade 'C' element (chr19: 50,466,809; top GWAS signal).Green vertical bars correspond to G-to-A transition and each one is highlighted by a green triangle (n = 11, chr4; n = 16, chr24).Other polymorphisms are shown with vertical color bars.Size and location of a double recombination are positioned by a red horizontal bar (163nt).(b) Schematic representation of a 163nt double recombination event between a clade 'C' (chr19: 50,466,809; top connected red rectangles) and a clade 'D' (chr8: 37,164, 679; bottom connected blue rectangles) ERVK[2-1-LTR] elements resulting in a recombinant 'C' element HI has also been observed as adjunct domains to the reverse transcriptase gene in retroviruses, in long-term repeat (LTR)bearing retrotransposons and non-LTR retrotransposons.RNase HI in LTR retrotransposons perform degradation of the original RNA template, generation of a polypurine tract (the primer for plus-strand DNA synthesis), and final removal of RNA primers from newly synthesized minus and plus strands.The catalytic residues for RNase H enzymatic activity, three aspartatic acids and one glutamic acid residue (DEDD), are unvaried across all RNase H domains. Phylogenetic patterns of RNase HI of LTR retroelements is classified into five major families, Ty3/Gypsy, Ty1/Copia, Bel/Pao, DIRS1 and the vertebrate retroviruses.Bel/Pao family has been described only in metazoan genomes.RNase H inhibitors have been explored as an anti-HIV drug target because RNase H inactivation inhibits reverse transcription.449-5711.63E-26 rve pfam00665Integrase core domain (IN);Integrase mediates integration of a DNA copy of the viral genome into the host chromosome.Integrase is composed of three domains.The amino-terminal domain is a zinc binding domain pfam02022.This domain is the central catalytic domain.The carboxyl terminal domain that is a non-specific DNA binding domain pfam00552.The catalytic domain acts as an endonuclease when two nucleotides are removed from the 3' ends of the blunt-ended viral DNA made by reverse transcription.This domain also catalyzes the DNA strand transfer reaction of the 3' ends of the viral DNA to the 5' ends of the integration site 633-727 7.17E-24IN_DBD_C pfam00552Integrase DNA binding domain (IN); Integrase mediates integration of a DNA copy of the viral genome into the host chromosome.Integrase is composed of three domains.The amino-terminal domain is a zinc binding domain.The central domain is the catalytic domain pfam00665.This domain is the carboxyl terminal domain that is a non-specific DNA binding ; This family includes envelope protein from a variety of retroviruses.It includes the GP41 subunit of the envelope protein complex from human and simian immunodeficiency viruses (HIV and SIV) which mediate membrane fusion during viral entry.The family also includes bovine immunodeficiency virus, feline immunodeficiency virus and Equine infectious anaemia (EIAV).The family also includes the Gp36 protein from mouse mammary tumor virus (MMTV) and human endogenous retroviruses (HERVs).512-7061.34E-35

Comparison between the real (dark grey) and simulated (lighter greys) frequency distribution of the resampling rate of ERVK[2-1-LTR] de novo mobilization events in BB bulls BE157971524, BE187351114, BE63811423.
9838rich in positively charged basic residues (basic domain in red) interacting with the membrane phospholipids.The score for myristoylation was predicted by Myristoylator (https://web.expasy.org/cgi-bin/myristoylator).(c)Consensusprimer binding site (PBS) shared by C and D-type elements aligned to the 3' end of the bovine CTT and TTT lysine tRNA genes.The simulations assumed that mobilization occurred in a developmental window centered on cell generation 14 of spermatogenesis of 21 (yielding 1,048,576 spermatogonial stem cells).Windows spanning 5, 7 or 9 cell generations, with mobilization rates distributed as shown in Fig.8b, matched the real data (3 bulls combined) equally well.The simulations further matched the real data with regards to bull-specific de novo insertion frequency (average number of de novo insertions per sperm cell) and number of explored haploid genomes (see Methods).Source data are provided as a Source Data file.

Table 1 : Detailed domain annotation of a representative element of the C-clade (chr19: 50,466,809 bp).
Specific hits and super-families were obtained by blasting each ORF (GAG, PRO, POL, ENV) against 'viruses' (taxid: 10239) non-redundant protein sequences.This is a subfamily of retropepsins.The family includes pepsin-like aspartate proteases from retroviruses, retrotransposons and retroelements.While fungal and mammalian pepsins are bilobal proteins with structurally related N-and C-termini, retropepsins are half as long as their fungal and mammalian counterparts.The monomers are structurally related to one lobe of the pepsin molecule and retropepsins function as homodimers.The active site aspartate (D) occurs within a motif (Asp-Thr/Ser-Gly), as it does in pepsin.Retroviral aspartyl protease is synthesized as part of the POL polyprotein that contains an aspartyl protease, a reverse transcriptase, RNase H, and an integrase.The POL polyprotein undergoes specific enzymatic cleavage to yield the mature proteins.In aspartate peptidases, Asp residues are ligands of an activated water molecule in all examples where catalytic residues have been identified.This domain is found in a number of RNA binding proteins, and is also found in proteins that contain RNA binding domains.This suggests that this domain may have an RNA binding function.This domain has seven highly conserved glycines.Reverse transcriptases (RTs) from retroviruses (Rtvs).RTs catalyze the conversion of single-stranded RNA into double-stranded viral DNA for integration into host chromosomes.Proteins in this subfamily contain long terminal repeats (LTRs) and are multifunctional enzymes with RNA-directed DNA polymerase, DNA directed DNA polymerase, and ribonuclease hybrid (RNase H) activities.The viral RNA genome enters the cytoplasm as part of a nucleoprotein complex, and the process of reverse transcription generates in the cytoplasm forming a linear DNA duplex via an intricate series of steps.Type 1 and Type 2, based on amino acid sequence similarities and biochemical properties.RNase H is an endonuclease that cleaves the RNA strand of an RNA/DNA hybrid in a sequence non-specific manner in the presence of divalent cations.RNase H is widely present in various organisms, including bacteria, archaea and eukaryote.RNase